Analysis For Instagram Data¶
in This project , iwill work on instagram data
Instagram Data Field Description¶
- Below is description of column field in in the dataset:
Impressions: Total number of views the post received measure reach
From Home: Views from the home feed traffic source
From Hashtags: Views coming through hashtags effectiveness
From Explore: Views from the Explore page discoverability to new users
From Other: Views from other sources stories, direct shares
Saves: Number of saves valuable content
Comments: Number of comments engagement and discussion
Shares: Number of times shared shareable the content is
Likes: Number of likes metric
Profile Visits: Visits to profile from this post interest
Follows: New followers gained from the post metric
Caption: Text content of the post for keywords or tone
Hashtags: Hashtags used in the post discoverability and reach
Question to be Answered dapending an Analysis¶
- Does the view_count affect the number of comments?
- Does the number of followers affect the number of likes?
- Do profile visits lead to more followers?
- Do posts with higher saves also receive more profile visits or follows?
- What is the relationship between impressions and likes, comments, and shares?
- What is the engagement rate per impression for each post?
- is there a relationship between the number of hashtags used and the impressions from hashtag?
##load needed Modules
import pandas as pd
## display all data columns
pd.options.display.max_columns=None
## load the dataset into DataFrame
df=pd.read_csv(r"C:\Users\DR SYSTEM\Downloads\Instagram data.csv")
## display first rows
df.head(2)
| Impressions | From Home | From Hashtags | From Explore | From Other | Saves | Comments | Shares | Likes | Profile Visits | Follows | Caption | Hashtags | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3920 | 2586 | 1028 | 619 | 56 | 98 | 9 | 5 | 162 | 35 | 2 | Here are some of the most important data visua... | #finance #money #business #investing #investme... |
| 1 | 5394 | 2727 | 1838 | 1174 | 78 | 194 | 7 | 14 | 224 | 48 | 10 | Here are some of the best data science project... | #healthcare #health #covid #data #datascience ... |
## check for dataframe shape
df.shape
(119, 13)
We found that the data has around 119 row with 13 columns
## check for data info (quality)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 119 entries, 0 to 118 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 Impressions 119 non-null int64 1 From Home 119 non-null int64 2 From Hashtags 119 non-null int64 3 From Explore 119 non-null int64 4 From Other 119 non-null int64 5 Saves 119 non-null int64 6 Comments 119 non-null int64 7 Shares 119 non-null int64 8 Likes 119 non-null int64 9 Profile Visits 119 non-null int64 10 Follows 119 non-null int64 11 Caption 119 non-null object 12 Hashtags 119 non-null object dtypes: int64(11), object(2) memory usage: 12.2+ KB
- there are some columns need to be dropped([' From Other'])
- some columns need to be renamed to be more readable
Feature Engineering¶
- add 3 columns to extract the home_ratio , haghtag_ratio , explore_ratio
- add 3 columns like_ratio , comment_ratio , share_ratio , save_ratio
- add columns hashtag_count
- add profile visit rate
## copy the dataframe
df_copy=df.copy()
## check for duplicates
df.duplicated().sum()
17
there is 17 duplicated row
## drop duplicates
df.drop_duplicates(inplace=True)
#check
df.duplicated().sum()
0
#check for null values
df.isnull().sum()
Impressions 0 From Home 0 From Hashtags 0 From Explore 0 From Other 0 Saves 0 Comments 0 Shares 0 Likes 0 Profile Visits 0 Follows 0 Caption 0 Hashtags 0 dtype: int64
## list columns to be renamed
df.columns
Index(['Impressions', 'From Home', 'From Hashtags', 'From Explore',
'From Other', 'Saves', 'Comments', 'Shares', 'Likes', 'Profile Visits',
'Follows', 'Caption', 'Hashtags'],
dtype='object')
# rename desired columns
df.rename(columns={'Impressions':'view_count'}, inplace=True)
# check dataframe
df.head(1)
| view_count | From Home | From Hashtags | From Explore | From Other | Saves | Comments | Shares | Likes | Profile Visits | Follows | Caption | Hashtags | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3920 | 2586 | 1028 | 619 | 56 | 98 | 9 | 5 | 162 | 35 | 2 | Here are some of the most important data visua... | #finance #money #business #investing #investme... |
## Proportion of view_count coming from Home , hashtags , explore
df['home_ratio']=df['From Home'] / df['view_count']
df['hashtag_ratio']=df['From Hashtags'] / df['view_count']
df['explore_ratio']=df['From Explore'] / df['view_count']
df['home_ratio']
df['hashtag_ratio']
df['explore_ratio']
0 0.157908
1 0.217649
2 0.000000
3 0.205830
4 0.110802
...
114 0.390657
115 0.395393
116 0.330273
117 0.532620
118 0.445408
Name: explore_ratio, Length: 102, dtype: float64
some posts receive more than 50% of their impressions from the Explore page.
## show like_ratio , comment_ratio , share_ratio , save_ratio
df['like_ratio']=df['Likes'] / df['view_count']
df['comment_ratio']=df['Comments'] / df['view_count']
df['share_ratio ']=df['Shares'] / df['view_count']
df['like_ratio']
df['comment_ratio']
df['share_ratio ']
0 0.001276
1 0.002595
2 0.000249
3 0.001546
4 0.001589
...
114 0.002774
115 0.000174
116 0.000242
117 0.002294
118 0.000704
Name: share_ratio , Length: 102, dtype: float64
The share_ratio is very low across all posts.
## add column (hashtag_count)
def hashtag_count(text):
if pd.isna(text):
return 0
return text.count('#')
df['Hashtag_count']=df['Hashtags'].apply(hashtag_count)
df['Hashtag_count'].head()
0 22 1 18 2 18 3 11 4 29 Name: Hashtag_count, dtype: int64
## add column profile visit rate
df['profile_visit_rate']=df['Profile Visits'] /df['view_count'] *100
df['profile_visit_rate']
0 0.892857
1 0.889878
2 1.541905
3 0.507951
4 0.317712
...
114 0.532847
115 0.348979
116 0.821454
117 0.452669
118 1.654974
Name: profile_visit_rate, Length: 102, dtype: float64
Add a column representing the profile visit rate as a percentage of total views for each post
#check dataframe
df.head(2)
| view_count | From Home | From Hashtags | From Explore | From Other | Saves | Comments | Shares | Likes | Profile Visits | Follows | Caption | Hashtags | home_ratio | hashtag_ratio | explore_ratio | like_ratio | comment_ratio | share_ratio | Hashtag_count | profile_visit_rate | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3920 | 2586 | 1028 | 619 | 56 | 98 | 9 | 5 | 162 | 35 | 2 | Here are some of the most important data visua... | #finance #money #business #investing #investme... | 0.659694 | 0.262245 | 0.157908 | 0.041327 | 0.002296 | 0.001276 | 22 | 0.892857 |
| 1 | 5394 | 2727 | 1838 | 1174 | 78 | 194 | 7 | 14 | 224 | 48 | 10 | Here are some of the best data science project... | #healthcare #health #covid #data #datascience ... | 0.505562 | 0.340749 | 0.217649 | 0.041528 | 0.001298 | 0.002595 | 18 | 0.889878 |
df.describe().round(2)
| view_count | From Home | From Hashtags | From Explore | From Other | Saves | Comments | Shares | Likes | Profile Visits | Follows | home_ratio | hashtag_ratio | explore_ratio | like_ratio | comment_ratio | share_ratio | Hashtag_count | profile_visit_rate | engagement rate (%) | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 102.00 | 102.00 | 102.00 | 102.00 | 102.00 | 102.00 | 102.00 | 102.00 | 102.00 | 102.00 | 102.00 | 102.00 | 102.00 | 102.00 | 102.00 | 102.0 | 102.00 | 102.00 | 102.00 | 102.00 |
| mean | 5920.25 | 2496.91 | 1968.28 | 1178.57 | 184.55 | 156.55 | 6.35 | 9.30 | 176.82 | 54.67 | 22.82 | 0.50 | 0.32 | 0.13 | 0.03 | 0.0 | 0.00 | 18.55 | 0.75 | 6.32 |
| std | 5139.89 | 1588.38 | 1977.30 | 2797.21 | 309.10 | 157.77 | 3.31 | 10.15 | 85.15 | 93.17 | 43.69 | 0.17 | 0.18 | 0.14 | 0.01 | 0.0 | 0.00 | 4.80 | 0.59 | 2.05 |
| min | 1941.00 | 1133.00 | 116.00 | 0.00 | 9.00 | 22.00 | 0.00 | 0.00 | 72.00 | 4.00 | 0.00 | 0.10 | 0.03 | 0.00 | 0.01 | 0.0 | 0.00 | 10.00 | 0.16 | 3.05 |
| 25% | 3556.00 | 1923.75 | 753.00 | 178.75 | 40.25 | 70.50 | 4.00 | 3.00 | 122.00 | 16.00 | 4.00 | 0.38 | 0.19 | 0.04 | 0.03 | 0.0 | 0.00 | 17.00 | 0.39 | 4.77 |
| 50% | 4343.50 | 2216.00 | 1326.00 | 337.00 | 75.00 | 111.00 | 6.00 | 6.50 | 157.50 | 24.00 | 8.00 | 0.49 | 0.30 | 0.08 | 0.03 | 0.0 | 0.00 | 18.00 | 0.50 | 6.22 |
| 75% | 6296.25 | 2605.25 | 2415.75 | 728.50 | 218.50 | 173.50 | 8.00 | 13.00 | 208.75 | 45.75 | 18.00 | 0.62 | 0.44 | 0.15 | 0.04 | 0.0 | 0.00 | 20.00 | 0.95 | 7.47 |
| max | 36919.00 | 13473.00 | 11817.00 | 17414.00 | 2547.00 | 1095.00 | 19.00 | 75.00 | 549.00 | 611.00 | 260.00 | 0.92 | 0.74 | 0.70 | 0.05 | 0.0 | 0.01 | 30.00 | 3.17 | 13.03 |
Q1: Does the view_count affect the number of comments?¶
#LOAD Nedeed Modules
import seaborn as sns
import matplotlib.pyplot as plt
sns.regplot(data=df, x='view_count' , y='Comments')
plt.title('view_count Vs Comment')
plt.ylabel('Comment')
plt.xlabel('view_count')
plt.show()
There is no strong correlation between the number of views and the number of comments on the posts in your data."
Q2: Does the number of followers affect the number of likes?¶
#LOAD Nedeed Modules
import plotly.express as px
px.scatter(df,x='Follows' , y='Likes' , trendline='ols')
there is a strong positive correlation between the number of likes and the number of new followers in your data.
Q3: Do profile visits lead to more followers?¶
#load nedeed Modules
import plotly.express as px
px.scatter(df, x='Profile Visits' , y='Follows' , trendline='ols')
Q4:Do posts with higher saves also receive more profile visits or follows?¶
#load nedeed Modules
import plotly.express as px
px.scatter(df,x='Saves',y='Follows',trendline='ols')
there is a strong positive correlation between the number of saves and the number of new followers
px.scatter(df,x='Saves',y='Profile Visits',trendline='ols')
There is also a positive but weaker correlation between saves and profile visits
Q5:What is the relationship between impressions and likes, comments, and shares?¶
#Load nedeed Modules
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
columns=['view_count','Likes','Comments','Shares']
correlation=df[columns].corr()
print(correlation)
sns.heatmap(correlation,annot=True)
plt.title('The relationship between impressions and engagement')
plt.show()
view_count Likes Comments Shares view_count 1.000000 0.852952 -0.008535 0.654920 Likes 0.852952 1.000000 0.163383 0.718790 Comments -0.008535 0.163383 1.000000 0.012697 Shares 0.654920 0.718790 0.012697 1.000000
Strong correlation between view_count and Likes (0.85): More views generally mean more likes.
Moderate correlation between view_count and Shares (0.65): More views lead to more shares.
No meaningful correlation between view_count and Comments (-0.0085): Comments don’t necessarily increase with views
Likes and Shares are highly correlated (0.72): People who like content are likely to share it.
Comments are not strongly correlated with anything in this matrix.
Q6:What is the engagement rate per impression for each post?¶
#Load nedeed Modules
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
df['engagement rate (%)']=((df['Likes']+df['Comments']+df['Saves']+df['Shares'])/df['view_count'])*100
print(df[['view_count','Likes','Comments','Shares','Saves','engagement rate (%)']])
view_count Likes Comments Shares Saves engagement rate (%) 0 3920 162 9 5 98 6.989796 1 5394 224 7 14 194 8.138673 2 4021 131 11 1 41 4.575976 3 4528 213 10 7 172 8.878092 4 2518 123 5 4 96 9.054805 .. ... ... ... ... ... ... 114 13700 373 2 38 573 7.197080 115 5731 148 4 1 135 5.025301 116 4139 92 0 1 36 3.116695 117 32695 549 2 75 1095 5.263802 118 36919 443 5 26 653 3.052629 [102 rows x 6 columns]
Q7:is there a relationship between the number of hashtags used and the impressions from hashtag?¶
#load nedeed Modules
import seaborn as sns
import matplotlib.pyplot as pl
plt.figure(figsize=(8, 6))
sns.regplot(data=df, x='Hashtag_count', y='From Hashtags')
plt.title("Relationship Between Hashtag Count and Impressions from Hashtags")
plt.xlabel("Number of Hashtags Used")
plt.ylabel("Impressions from Hashtags")
plt.show()
"Do not rely only on the quantity of hashtags but on the quality and relevance of the hashtags to your content.
(It is better to choose more specific and strongly related hashtags.)
Colclusion¶
We found that the data has around 119 row with 13 columns
There is no strong correlation between the number of views and the number of comments on the posts in your data."
there is a strong positive correlation between the number of likes and the number of new followers in your data.
there is a strong positive correlation between the number of saves and the number of new followers
There is also a positive but weaker correlation between saves and profile visits
Strong correlation between view_count and Likes (0.85): More views generally mean more likes.
Moderate correlation between view_count and Shares (0.65): More views lead to more shares.
No meaningful correlation between view_count and Comments (-0.0085): Comments don’t necessarily increase with views
Likes and Shares are highly correlated (0.72): People who like content are likely to share it.
Comments are not strongly correlated with anything in this matrix.
"Do not rely only on the quantity of hashtags but on the quality and relevance of the hashtags to your content.
(It is better to choose more specific and strongly related hashtags.)